Combining Statistical Translation Techniques for Cross-Language Information Retrieval
نویسندگان
چکیده
Cross-language information retrieval today is dominated by techniques that rely principally on context-independent token-to-token mappings despite the fact that state-of-the-art statistical machine translation systems now have far richer translation models available in their internal representations. This paper explores combination-of-evidence techniques using three types of statistical translation models: context-independent token translation, token translation using phrase-dependent contexts, and token translation using sentence-dependent contexts. Context-independent translation is performed using statistically-aligned tokens in parallel text, phrase-dependent translation is performed using aligned statistical phrases, and sentence-dependent translation is performed using those same aligned phrases together with an n-gram language model. Experiments on retrieval of Arabic, Chinese, and French documents using English queries show that no one technique is optimal for all queries, but that statistically significant improvements in mean average precision over strong baselines can be achieved by combining translation evidence from all three techniques. The optimal combination is, however, found to be resource-dependent, indicating a need for future work on robust tuning to the characteristics of individual collections.
منابع مشابه
Improved Cross-Language Retrieval using Backoff Translation
The limited coverage of available translation lexicons can pose a serious challenge in some cross-language information retrieval applications. We present two techniques for combining evidence from dictionary-based and corpus-based translation lexicons, and show that backoff translation outperforms a technique based on merging lexicons.
متن کاملCombining lexical and statistical translation evidence for cross-language information retrieval
This paper explores how best to use lexical and statistical translation evidence together for CrossLanguage Information Retrieval (CLIR). Lexical translation evidence is assembled from Wikipedia and from a large machine readable dictionary, statistical translation evidence is drawn from parallel corpora, and evidence from co-occurrence in the document language provides a basis for limiting the ...
متن کاملClef Experiments at Maryland: Statistical Stemming and Backoo Translation
The University of Maryland participated in the CLEF 2000 multilingual task, submitting three oocial runs that explored the impact of applying language-independent stemming techniques to dictionary-based cross-language information retrieval. The paper begins by describing a cross-language information retrieval architecture based on balanced document translation. A four-stage backoo strategy for ...
متن کاملCLEF Experiments at the University of Maryland: Statistical Stemming and Back-off Translation Strategies
The University of Maryland participated in the CLEF 2000 multilingual task, submitting three o cial runs that explored the impact of applying language-independent stemming techniques to dictionary-based cross-language information retrieval. The paper begins by describing a cross-language information retrieval architecture based on balanced document translation. A four-stage backo strategy for i...
متن کاملTechnical issues of cross-language information retrieval: a review
This paper reviews state-of-the-art techniques and methods for enhancing effectiveness of cross-language information retrieval (CLIR). The following research issues are covered: (1) matching strategies and translation techniques, (2) methods for solving the problem of translation ambiguity, (3) formal models for CLIR such as application of the language model, (4) the pivot language approach, (5...
متن کامل